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Abstract 

In statistical modeling area, the Akaike information criterion AIC, is a widely 
known and extensively used tool for model choice. The (^-divergence test statistic 
is a recently developed tool for statistical model selection. The popularity of the 
divergence criterion is however tempered by their known lack of robustness in small 
sample. In this paper the penalized minimum Hellinger distance type statistics are 
considered and some properties are established. The limit laws of the estimates 
and test statistics are given under both the null and the alternative hypotheses, 
and approximations of the power functions are deduced. A model selection criterion 
relative to these divergence measures are developed for parametric inference. Our 
interest is in the problem to testing for choosing between two models using some 
informational type statistics, when independent sample are drawn from a discrete 
population. Here, we discuss the asymptotic properties and the performance of new 
procedure tests and investigate their small sample behavior. 

Key words: Generalized information, estimation, hypothesis test, Monte Carlo 
simulation. 
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1 Introduction 

A comprehensive surveys on Pearson Chi-square type statistics has been pro- 
vided by many authors as Cochran (1952), Watson (1956) and Moore (1978,1986), 
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in particular on quadratics forms in the cell frequencies. Recently, Andrews(1988a, 
1988b) has extended the Pearson chi-square testing method to non-dynamic 
parametric models, i.e., to models with covariates. Because Pearson chi-square 
statistics provide natural measures for the discrepancy between the observed 
data and a specific parametric model, they have also been used for discriminat- 
ing among competing models. Such a situation is frequent in Social Sciences 
where many competing models are proposed to fit a given sample. A well know 
difficulty is that each chi-square statistic tends to become large without an in- 
crease in its degrees of freedom as the sample size increases. As a consequence 
goodness-of-fit tests based on Pearson type chi-square statistics will generally 
reject the correct specification of every competing model. 

To circumvent such a difficulty, a popular method for model selection, which 
is similar to use of Akaike (1973) Information Criterion (AIC), consists in 
considering that the lower the chi-square statistic, the better is the model. 
The preceding selection rule, however, does not take into account random 
variations inherent in the values of the statistics. 

We propose here a procedure for taking into account the stochastic nature of 
these differences so as to assess their significance. The main propose of this 
paper is to address this issue. We shall propose some convenient asymptoti- 
cally standard normal tests for model selection based on 0— divergence type 
statistics. Following Vuong (1989, 1993), the procedures considered here are 
testing the null hypothesis that the competing models are equally close to 
the data generating process (DGP) versus the alternative hypothesis that one 
model is closer to the DGP where closeness of a model is measured according 
to the discrepancy implicit in the 0— divergence type statistic used. Thus the 
outcomes of our tests provide information on the strength of the statistical 
evidence for the choice of a model based on its goodness-of-fit. The model 
selection approach proposed here differs from those of Cox (1961, 1962) and 
Akaike (1974) for non nested hypotheses. This difference is that the present 
approach is based on the discrepancy implicit in the divergence type statistics 
used, while these other approaches as Vuong's (1989) tests for model selection 
rely on the Kullback-Leibler (1951) information criterion (KLIC). 
Beran (1977) showed that by using the minimum Hellinger distance estimator, 
one can simultaneously obtain asymptotic efficiency and robustness properties 
in the presence of outliers. The works of Simpson (1989) and Lindsay (1994) 
have shown that, in the tests hypotheses, robust alternatives to the likelihood 
ratio test can be generated by using the Hellinger distance. We consider a 
general class of estimators that is very broad and contains most of estimators 
currently used in practice when forming divergence type statistics. This covers 
the case studies in Harris and Basu (1994); Basu et al. (1996); Basu and Basu 
(1998) where the penalized Hellinger distance is used. 

The remainder of this paper is organized as follows. Section 2 introduces the 
basic notations and definitions. Section 3 gives a short overview of divergence 
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measures. Section 4 investigates the asymptotic distribution of the penalized 
Hellinger distance. In section 5, some applications for testing hypotheses are 
proposed. Section 6 presents some simulation results. Section 7 concludes the 
paper. 



2 Definitions and notation 

In this section, we briefly present the basic assumptions on the model and pa- 
rameters estimators, and we define our generalized divergence type statistics. 
We consider a discrete statistical model, i.e Xi, X 2 , ■ ■ ■ X n an independent 
random sample from a discrete population with support X — {1, . . . , m}. Let 
P = (pi, . . . ,p m ) T be a probability vector i.e P G fi m where Q m is the simplex 
of probability m- vectors, 



which may or may not contain the true distribution P, where O is a compact 
subset of k-dimensional Euclidean space (with k < m — 1). If V cointains P, 
then there exists a 9 G such that P do = P and the model V is said to be 
correctly specified. 

We are interested in testing 

H : P G V ( with true parameter 6 ) versus H\ : P G VL m — V. 

By || • || we denote the usual Euclidean norm and we interpret probability 
distributions on X as row vectors from M. m . For simplicity we restrict ourselves 
to unknown true parameters 9 satisfying the classical regularity conditions 
given by Birch (1964): 

1. True #o is an interior point of and pig > for % = l,...,m. Thus 
Pg = (pw , ■ ■ ■ ,p me J is an interior point of the set Vt m . 

2. The mapping P : — > il m is totally differentiable at 9 so that the partial 
derivatives of pi with respect to each 9j exist at 9 and Pi{9) has a linear 
approximation at 9q given by 



= { (pi,P2, • • • ,p m ) G lR m = 1, Pi > 0,i = 1, . . . ,m}. 



i=i 



We consider a parameter model 



v = {p e = (p 1 (9),---,p m (0)Y ■■ 9ee} 



k 



d Pt (9 ) 
d0 3 



Pi(0)=Pi(0o) + E(*;-*<«) 



+ o(\\9-9 ||) 
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where o(\\ 9 — 9 ||) denotes a function verifying lim ^ — H° P — 0- 

8 — >0o || — Uq II 

3. The Jacobian matrix J(8 ) = ( -— ^ ) = f °^ | is of full rank 

\ OV a a V ^7 h<i<m 

\ / 0=0 O \ j / 1 <^ fc 

(i.e. of rank k and k < m). 

4. The inverse mapping P -1 : P — > is continuous at P# . 

5. The mapping P : — )> Vt rn is continuous at every point 6 G 0. 



Under the hypothesis that P G V, there exists an unknown parameter 6 
such that P = Pq and the problem of point estimation appears in a nat- 
ural way. Let n be sample size. We can estimate the distribution Pg = 
(pi{0) , P2{0) , ■ ■ ■ ,Pm(0)) T by the vector of observed frequencies P — (pi, . . . ,p m ) 
on ie of measurable mapping X n — > Q m . 
This non parametric estimator P — (p 1 , . . . ,p m ) is defined by 

Nh ■ I 1 if Xt = j 

h = "f ' N i = TT]( X i) wh ^re 7]pQ) = (2.1) 
" i=i I otherwise 

We can now define the class of 0-divergence type statistics considered in this 
paper. 



3 A brief review of 0-divergences 



Many different measures quantifying the degree of discrimination between two 
probability distributions have been studied in the past. They are frequently 
called distance measures, although some of them are not strictly metrics. They 
have been applied to different areas, such as medical image registration (Josien 
PW. Pluim, 2001), classification and retrieval, among others. This class of 
distances is referred, in the literature, as the class of 0, f or g-divergences 
(Csiszar (1967); Vajda (1989); Morales et al. (1995); Pardo (2006); Bassetti 
et al. (2007)) or the class of disparities (Lindsay (1994)). The divergence mea- 
sures play an important role in statistical theory, specially in large theories of 
estimation and testing. 

Later many papers have appeared in the literature, where divergence or en- 
tropy type measures of information have been used in testing statistical hy- 
potheses. Among others we refer to McCulloch (1988), Read and Cressie 
(1988), Zografos et al. (1990), Salicra et al. (1994), Bar-Hen and Daudin 
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(1995), Menendez et al. (1995, 1996, 1997), Pardo et al. (1995), Morales et 
al. (1997, 1998), Zografos (1994, 1998), Bar-Hen (1996) and the references 
therein. A measure of discrimination between two probability distributions 
called ^-divergence, was introduced by Csiszdr (1967). 

Recently, Broniatowski et al. (2009) presented a new dual representation for di- 
vergences. Their aim was to introduce estimation and test procedures through 
divergence optimization for discrete or continuous parametric models. In the 
problem where independent samples are drawn from two different discrete 
populations, Basu et al. (2010) developped some tests based on the Hellinger 
distance and penalized versions of it. 

Consider two populations X and Y, according to classifications criteria can 
be grouped into m classes species xi, x 2 , ■ ■ ■ , x m and yx, y 2 , ■ ■ ■ , y m with prob- 
abilities P = (pi,p 2 , ■ ■ ■ ,p m ) an d Q = (Qi, <?2, • • • , q m ) respectively. Then 

m n- 

^(P,Q) = E^(-) (3-2) 

i=l li 



is the 4>— divergence between P and Q (see Csiszdr, 1967) for every <fr in the 
set $ of real convex functions defined on [0, oof. The function <p(t) is assumed 
to verify the following regularity condition : 

4> : [0,+oo[ — > R U {oo} is convex and continuous, where O0(|) = and 
O0(o) = >°° ( ( P( U )/ U )- Its restriction on ]0, +oo[ is finite, twice continu- 
ously differentiable in a neighborhood of u = 1, with 0(1) = = and 
0"(1) = 1 (cf. Liese and Vajda (1987)). 
We shall be interested also in parametric estimators 

Q = Q n = P§ (3-3) 

of Pq which can be obtained by means of various point estimators 

§ = §W : X {n) — > 

of the unknown parameter 6 . 

It is convenient to measure the difference between observed P and expected 
frequencies Pg . A minimum Divergence estimator of 6 is a minimizer of 
D^P, Pg ) where P is a nonparametric distribution estimate. In our case, 
where data come from a discrete distribution, the empirical distribution de- 



fined in (2.1) can be used. 



In particular if we replace <fii(x) = —A[yfx — + 1)] in (3.2) we get the 
Hellinger distance between distribution P and Pq given by 



5 



2 

D <Pl (P,Pe)=HD c , 1 (P,P e ) = 2^(p} /2 -p 1 i /2 (0)) 5 01 e (3.4) 



i=l 



Liese and Vajda (1987), Lindsay (1994) and Morales et al. (1995) introduced 
the so-called minimum 0- divergence estimate defined by 



D^{P,P$ =min D+{P,P e ) ; G $. (3.5) 



9^ = arg mm D^P^Pe) ; G $. (3.6) 



Remark 3.1 The class of estimates (3.4) contains the maximum likelihood 
estimator (MLE). 

In particular if we replace 4> = — log x + x — 1 we get 

m 

&KL m = arg mm KL m (P e , P) = arg min^ - \ogpi(6)pi = MLE 

i=l 

where KL m is the modified Kullback-Leibler divergence. 

Beran (1977) first pointed out that the minimum Hellinger distance estimator 
(MHDE) of 9, defined by 

§ H = argmmHDJP,P e ) (3.7) 
has robustness proprieties. 

Further results were given by Tamura and Boos (1986), Simpson (1987), and 
Donoho and Liu (1988), Simpson (1987, 1989) and Basu et al. (1997) for more 
details on this method of estimation. Simpson, however, noted that the small 
sample performance of the Hellinger deviance test at some discrete models 
such as the Poisson is somewhat unsatisfactory, in the sense that the test 
requires a very large sample size for the chi-square approximation to be useful 
(Simpson (1989), Table 3). In order to avoid this problem, one possibility is to 
use the penalized Hellinger distance (see Harris and Basu, (1994); Basu et al., 
(1996); Basu and Basu, (1998) ; Basu et al. (2010)). The penalized Hellinger 
distance family between the probability vectors P and Pg is defined by : 



PHD h (P,P e 



m „ m 



where h is a real positive number with w = {i : pi ^ 0} and w c = {i : pi = 0}. 
Note that when h = 1, this generates the ordinary Hellinger distance (Simpson, 
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1989). 



Hence (3.7) can be written as follows 



9 PH = argmmPHD h JP,P ) 



(3.9) 



One of the suggestions to use the penalized Hellinger is motivated by the fact 
that this suitable choice may lead to an estimate more robust than the MLE. 
A model selection criterion can be designed to estimate an expected overall 
discrepancy, a quantity which reflects the degree of similarity between a fitted 
approximating model and the generating or true model. Estimation of Roll- 
back's information (see Kullback-Leibler (1951)) is the key to deriving the 
Akaike Information criterion AIC (Akaike (1974)). 

Motivated by the above developments, we propose by analogy with the ap- 
proach introduced by Vuong (1993), a new information criterion relating to the 
^-divergences. In our test, the null hypothesis is that the competing models 
are as close to the data generating process (DGP) where closeness of a model 
is measured according to the discrepancy implicit in the penalized Hellinger 
divergence. 



4 Asymptotic distribution of the penalized Hellinger distance 

Hereafter, we focus on asymptotic results. We assume that the true parameter 
6q and mapping P : — > Q m satisfy conditions 1-6 of Birch (1964). 
We consider the m- vector Pg = (pig, . . . ,p m g) T , the m x k Jacobian matrix 

d 

■h = [Jji(8)) i=1 m . 7 = i fc with Jji(9) = ^rrPje, the m x k matrix D e = 
diag (P e 1 ^ 2 ) Jg and the k x k Fisher information matrix 



The above defined matrices are considered at the point 6 G © where the 
derivatives exist and all the coordinates Pj(0) are positive. 

The stochastic convergences of random vectors X n to a random vector X are 
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(This relation means linx^oo limsup. c _ >00 P(||c n X n || > x) = 0) 

An estimator P of Pg is consistent if for every # e B the random vector 

(pi, . . . ,p m ) tends in probability to (p Wo . . . ,p m e ), i-e. if 



lim 

n — ►oo 



P-P e „ ||> e) = for all e > 0. 



We need the following result to prove Theorem (4.3) 



Proposition 4.1 (Mandal et al. 2008) 

Let <fi G let p : O — > Q m be twice continuously differentiable in a neigh- 
borhood of 6q and assume that conditions 1-5 of Section 2 hold. Suppose that 



Ig Q is the k x k Fisher Information matrix and Qpu satisfying (3.7) then the 
limiting distribution of y/n(9 PH — 9 ) as n — > +oo is N[0, I^ 1 ] 



Lemma 4.2 We have 



0,E, 



where P(9 ) = (pw , ■ ■ ■ ,p m e ) an estimator of P 9o = (pw , ■ ■ ■ ,p m e ) defined 



in (2.1) with 
E P = diag(P 0o ) 



p o P(f - 



proof. Denote V 



Ni - npi 9o 



N„ 



np m e 



n 



and Nj = J2 T j where T j( X i 



n 



1 siX i=J 
otherwise 



V 



1 /1A 



T[ - np Wo j ; . . . 

\i=l } 

and applying the Central Limit Theorem we have 



n \ n 



Ni - npw N m - nprnoA ^ ^ 



n 



n 



where 



E P = diag(P eo ) - Po PL 



(4.10) 
□ 



For simplicity, we write P^(P,P^ ) instead of PHD h (P, P- ). 
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Theorem 4.3 Under the assumptions of Proposition (4.1), we have 



where 



A 6 



E 6o Ml - Me Z 6o + M 6o E 6o Ml 



Me = JeI e '{Oo) T diag(plf) 



(4.11) 



proof. A first order Taylor expansion gives 



Pe o + J0 o (Op H -9o) T + o(\\9p H -9, 



0PH 



(4.12) 



In the same way as in Morales et al. (1995), it can be established that : 



9pH = Oo + h o lD 9 dia 9 



P t 



-1/2 



r 



P-P d0 ) +o(\\P-P t 



00 I 



(4.13) 



From (4.12) and (4.13) we obtain 



P o PH = P <>o + Je I'\9 Q )D T eo diag 
therefore the random vectors 



P-Peo 



and 



2mxl 



p-1/2 



I 

Ma 



P-P 6n ) +o(\\P-P Bl 



x (P — Pd )mxl 



2mxm 



Where I is the m x m unity matrix, have the same asymptotic distribution. 
Furthermore it is clear (applying TCL) that 

yftl(P-Po )^M[0,Zg } 

being £g the m x m matrix diag [Pq ] — Pe Q Pj implies 



P-POo 



2mx 1 



0. 



Mo 



therefore, we get 
^(P-Pe PH ) 



A, 



v^(P - P eo ) + MPe ~ P f 



6 PH < 



A/[O,A(0 O )] (4.14) 



E eo Ml - M 6o E 9o + M eo Ze Mj o 



□ 
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The case which is interest to us here is to test the hypothesis H : P G 
V . Our proposal is based on the following penalized divergence test statistic 



Djj(P,Pz ) where P and 9ph have been introduced in Theorem (4.3) and 



(3.7) respectively. 



Using arguments similar to those developed by Basu (1996), under the assump- 



tions of (4.3) and the hypothesis H : P = Pg, the asymptotic distribution 



of 2nDjj(P,P^ ) is a chi-square when h = 1 with m — k — 1 degrees of 
freedom. Since the others members of penalized Hellinger distance tests differ 
from the ordinary Hellinger distance test only at the empty cells, they too 
have the same asymptotic distribution. (See Simpson 1989, Basu, Harris and 
Basu 1996 among others). 

Considering now the case when the model is wrong i.e H\ : P ^ Pg. We 
introduce the following regularity assumptions 

(Ai) There exists 9\ = arg infg & QPHD h (P, Pg) such that : 



P e PH ^ p e, 



when n — > +oo 
(A 2 ) There exists X G ; A* 
A12 = A 2 i such that 



A n A 
A21 A22 
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with An = T, p in (4.10) and 



P - P 



P e PH ~ p 0i 



AA[0,A*] 



Theorem 4.4 Under H\ : P 7^ Pg and assume that conditions (Ai) and (A2) 
hold, we have : 



where 



(D h H (P, P^ pH ) - D h H (P, P ei )) ^ M 



0) &\e,p) 



fi(«,p) = H T A U H + H T A 12 J + J T A 2l H + J T A 22 J 



H T = (hi, ... , h m ) with hi 
and 

J T = Ui,---Jm) withji = 



s u ri / pi=p, p 2=p(g 1 ) 



' d_ 

dpi 



D h H (p\p 2 ) 



(4.15) 



; i = 1, . . . , m 
, i = 1, . . . , m 



p 1 =p,p 2 =p(9i) 



proof. A first order Taylor expansion gives 
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D h H (P, P %R ) = D h H (P, P 9l ) + H T (P -P) + J T (P~ pH - P 9l ) 

+ (||P-P|| + ||P ? ^-P ei ||) (4.16) 

From the assumed assumptions {A\) and (^2), the result follows. □ 



5 Applications for testing hypothesis 

The estimate P^(P, Pf ) can be used to perform statistical tests. 
5. 1 Test of goodness-fit 



For completeness, we look at D^(P, Pg ) in the usual way, i.e., as a goodness- 
of-flt statistic. Recall that here Oph is the minimum penalized Hellinger dis- 
tance estimator of 9. Since _D^(P, Pg ) is a consistent estimator of _D^(P, Pg), 

the null hypothesis when using the statistic P^(P, P~g PH ) is 

H : D h H (P, Pg) = or equivalently, H : P = Pg 

Hence, if H is rejected so that one can infer that the parametric model Pg is 
misspecified. Since D^(P,Pg) is non- negative and takes value zero only when 
P = Pg, the tests are defined through the critical region 



C 0PH = {2nD h H (P,P tpH ) > q a , k } 

where q a ^ is the (1 — a)— quantile of the \ 2 ~ distribution with m — k — 1 
degrees of freedom. 

Remark 5.1 Theorem ( 4.4[ ) can be used to give the following approximation 
to the power of test H :D%(P, Pg) = 0. 

Approximated power function is 

fi m = P [2nD i( P, P, PH ) > « 1 - T. fr^gP ) (5.17) 

where q a ^ is the (1 — a)-quantile of the x 2 distribution with m — k — 1 degrees 
of freedom and J-" n is a sequence of distribution functions tending uniformly 
to the standard normal distribution ^(x). Note that if H : D^j{P,Pg) ^ 0, 
then for any fixed size a the probability of rejection H : D^(P, Pg) = with 
the rejection rule 2nD^(P, P~ ) > q ak tends to one as n — >■ 00. 
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Obtaining the approximate sample n, guaranteeing a power f3 for a give alter- 
native P, is an interesting application of formula (5.17). If we wish the power 
to be equal to /?*, we must solve the equation 



P* 



\-7 



n 



n 



(8,P) 



2n 



q aM - D h H {P, Pg 



It is not difficult to check that the sample size n* , is the solution of the following 
equation 



n 2 D h H (P, P e ) 2 - nD h H (P, P e )q a , k + 
The solution is given by 



Qa,k 



n 



(a + b) - ya(a + 26) 
2D h H (P,P e y 



with a = fi(0 p) [J 7 l (l — /3)] 2 and b = q at kD'} I (P, Pg) and the required size is 
= [n*] + 1 , where [■] denotes "integer part of. 



5.2 Test for model selection 



As we mentioned above, when one chooses a particular 0— divergence type 
statistic Djj(P, Pg pH ) = PHP > \{P-, Pg PH ) with 9 PH the corresponding mini- 
mum penalized Hellinger distance estimator of 9, one actually evaluates the 
goodness-of-fit of the parametric model Pe according to the discrepancy D^(P, Pg) 
between the true distribution P and the specified model Pe- Thus it is naturel 
to define the best model among a collection of competing models to be the 
model that is closest to the true distribution according to the discepancy 

In this paper we consider the problem of selecting between two models. Let 
= {G(. \n);ne T} be another model, where T is a q— dimensional para- 
metric space in R q . In a similar way, we can define the minimum penal- 
ized Hellinger distance estimator of fi and the corresponding discrepancy 
D%{P, Gfj) for the model G M . 



Our special interest is the situation in which a researcher has two competing 

parametric models Pg and G^, and he wishes to select the better of two models 

based on their discrimination statistic between the observations and models 

P e and G M , defined respectively by D h H (P,P~ pH ) and D h H (P,G~ pH ). 

Let the two competing parametric models Pg and G^ with the given discrep- 

aacyD*(iV). 
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Definition 5.2 

Hq 9 : Dff(P, Pq) = Dff(P,Gfj) means that the two models are equivalent, 

H Pg :D^(P,P e ) < D l ^(P,G fl ) means that P e is better than G^, 

Hq : Dff(P, Pq) > D I } I (P,G f j) means that Pq is worse than G^, 



Remark 5.3 1) It does not require that the same divergence type statistics be 
used in forming Dtf(P, P~ ) and D^j(P,G^ pH ). Choosing, however, different 
discrepancy for evaluating competing models is hardly justified. 

2) This definition does not require that either of the competing models be cor- 
rectly specified. On the other hand, a correctly specified model must be at least 
as good as any other model. 

The following expression of the indicator D^(P, Pq) — D^i^P, G^) is unknown, 
but from the previous section, it can be estimated by the the difference 



D h B (P,P t )-D} I (P,G 



fJ-PH ' 



This difference converges to zero under the null hypothesis H$ q , but converges 
to a strictly negative or positive constant when Hp g or Hq holds. 
These properties actually justify the use of D H (P, Pq ) — D H (P, G~ pH ) as a 
model selection indicator and common procedure of selecting the model with 
highest goodness-of-fit. 

As argued in the introduction, however, it is important to take into account 
the random nature of the difference D H (P,P^ ) — D^P ,G~ pH ) so as to 
assess its significance. To do so we consider the asymptotic distribution of 
y/ri [D h H (P, PqJ - D h H (P, G ?lpH )} under H?. 

Our major task is to to propose some tests for model selection, i.e., for the null 
hypothesis Hq Q against the alternative Hp g or Hq^. We use the next lemma 
with 6ph and fipu as the corresponding minimum penalized Hellinger distance 
estimator of 9 and \i. 

Using P and Pq defined earlier, we consider the vector 



Kj = (h, . . . , k m ) where h = ( ^/^(P 1 , P z ) ) with % = 1, 

=p,p 2 =p 8 




Qf = (qi,..., q m ) where q { = (j^D h H (P\ P 2 )) with i 



m 



pl_pp2 = 



Lemma 5.4 Under the assumptions of the Theorem (4.4), we have 
(i) for the model Pq, 
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D h H (P, Pe PH ) = D h H (P, Pe) + Kj{P - P) + Ql(P^ pH - P e ) + o P {l) 
(ii) for model G^, 

P>h(P, G» ph ) = D h H (P, Gfj) + KliP -P) + Ql(G~ pH - Gfj) + o P (l) 

proof. 

The results follows from a first order Taylor expansion. □ 
We define 

T 2 = {K e - Q e - Q^ T A*(K e - Q e - Q„) 

( 



which is the variance of (Kg — K„] Q g — Q f 



P - P 

\ P 0PH ~ ^ 



. Since K , 



Qe, Qfi and A* are consistently estimated by their sample analogues K^, K~, 
Qgi Qu an d A*, hence T 2 is consistently estimated by 

f 2 = (K ? - K$ Qg- - Q-) T A* (Ay - Q ? - Q$ 



Next we define the model selection statistic and its asymptotic distribution 

under the null and alternatives hypothesis. 

Let 

where T-LX h stands for the penalized Hellinger Indicator. 

The following theorem provides the limit distribution of r HX h under the null 

and alternatives hypothesis. 



Theorem 5.5 Under the assumptions of theorem (4.4), suppose that 
r ^ 0, then: 

(i) Under the null hypothesis , Hl h ^ W(0, 1) 

(ii) Under the null hypothesis H Pfj , V,I h — > — oo in probability 
(Hi) Under the null hypothesis H Gfi , T-LI h — > +oo in probability 

proof. 

From the lemma (5.4), it follows that 



D h H (P, P~ pH ) - D h H (P, G? PB ) = D h H (P, Pe) - D h H (P, G») + Kj(P - P) - K^P - P) 

+ Qe( P e PH ~ p e) ~ Q T ^ PH ~ G») + o P (l) 
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Under H e q :P e = G^ and P^ = G~ pR we get : 



D h H (P, P ?pH ) - D h H (P, G~ pH ) = Kj(P - P) - Kl{P - P) 



+ Qi (P %h - P e ) - Q^ pH - Pe) + o P {l) 



= {K e -K ll ,Q e -Q ll ) 



T 



( p_p 



fpH 



+ o P (l) 



Finally, applying the Central Limit Theorem and assumptions (A1)-(A2), we 
can now immediately obtain Ul h ^ JV(0, 1). □ 



6 Computational results 

6. 1 Example 



To illustrate the model procedure discussed in the preceding section,we con- 
sider an example, we need to define the competing models, the estimation 
method used for each competing model and the Hellinger penalized type statis- 
tic to measure the departure of each proposed parametric model from the true 
data generating process. 

For our competing models, we consider the problem of choosing between the 
family of poisson distribution and the family of geometric distribution. The 
poisson distribution P(A) is parameterized by A and has density 

*t n exp(-A) X \ x tvt i . 

j (x, A) = : for a; G JN and zero otherwise. 

The geometric distribution G(p) is parameterized by p and has density 

g(x,p) = (1 — p)*' 1 x p for a; G N* and zero otherwise. 

We use the minimum penalized Hellinger distance statistic to evaluate the 
discrepancy of the proposed model from the true data generating process. 
We partition the real line into m intervals {[Cj_i,Cj[, % — 1, • • • ,m} where 
Co = and C rn = +oo. The choice of the cells is discussed below. 
The corresponding minimum penalized Hellinger distance estimator of A et p 
are : 



15 





Figure 1 : Histogram of DGP=Pois(4) with n=50 Figure 2 : Comparative barplot of HI n depending n 

m m 
,1/2 1/2x2 



Xph = arg min D H (P, P\) = arg mm 



Pph = a- r g m in D%(P, P p ) = arg min 



E(/r-pif) a +*£* 



idvo c 



Pix and and are probabilities of the cells [Cj_i,Cj[ under the poisson and 
geometric true distribution respectively. 

We consider various sets of experiments in which data are generated from the 
mixture of a poisson and geometric distribution. These two distributions are 
calibrated so that their two means are close (4 and 5 respectively). Hence the 
DGP (Data Generating Process) is generated from M(tt) with the density 

m (vr) = 7T Pois(A) + (1 - 7r) Geom(0.2) 

where tt(tt G [0, 1]) is specific value to each set of experiments. In each set 
of experiment several random sample are drawn from this mixture of distri- 
butions. The sample size varies from 20 to 300, and for each sample size the 
number of replication is 1000. In each set of experiment, we choose two val- 
ues of the parameter h = 1 and h = 1/2, where h = 1 corresponds to the 
classic Hellinger distance. The aim is to compare the accuracy of the selection 
model depending on the parameter setting chosen. In order a perfect fit by 
the proposed method, for the chosen parameters of these two distributions, we 
note that most of the mass is concentrated between and 10. Therefore, the 
chosen partition has eight cells defined by {[Cj_i, Ci[= [i — i = 1, • • • ,7} 
and [CV, C 8 [= [7, +oo[ represents the last cell. We choose different values of 7r 
which are 0.00, 0.25,0.535, 0.75, 1.00. Although our proposed model selection 
procedure does not require that the data generating process belong to either 
of the competing models, we consider the two limiting cases 7r = 1.00 and 
7r = 0.00 for they correspond to the correctly specified cases. To investigate 
the case where both competing models are misspecified but not at equal dis- 
tance from the DGP, we consider the case tt = 0.25, n = 0.75 and n = 0.535. 
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n 


20 


30 


40 


50 


300 


V 


0.210(0.03) 


0.195(0.03) 


0. 197(0.02) 


0.205(0.02) 


0.201(0.01) 


X 


3.950(0.46) 


4.090(0.4) 


4.015(0.31) 


4.015(0.28) 


4.011(0.13) 


DHP(Pois) 


h — 1 




U.UO-llU.UOl 




n nAofn n^~\ 
u.m'iu.uj / 


n n^7fn ni ~\ 

U.UO 1 IU-U1 y 




h=l/2 


0.096(0.04) 


0.064(0.03) 


0.048(0.02) 


0.034(0.02) 


0.03(0.01) 


DHP(Geom) 


h — 1 


0.391(0.28) 


0.348(0.12) 


0.298(0.09) 


0.282(0.10) 


0.271(0.05) 




h=l/2 


0.278(0.07) 


0.262(0.08) 


0.242(0.06) 


0.236(0.06) 


0.231(0.03) 


HI h 


h = 1/2 


-3.67(2.14) 


-4.32(2.69) 


-4.34(2.38) 


-4.83(2.52) 


-4.97(2.18) 




Correct 


77% 


87% 


92% 


96% 


100% 




Indecisive 


23% 


13% 


08% 


04% 


00% 




Incorrect 


00% 


00% 


00% 


00% 


00% 


HX h 


h = 1 


-3.61(3.03) 


-3.98(2.48) 


-3.73(2.29) 


-4.16(2.35) 


-4.25(1.87) 




Correct 


70% 


79% 


83% 


86% 


93% 




Indecisive 


30% 


21% 


17% 


14% 


07% 




Incorrect 


00% 


00% 


00% 


00% 


00% 



Table 1 : DGP=Pois(4) 





Figure 3 : Histogram of DGP=Geom(0.2) with n=50 Figure 4 : Comparative barplot of HI n depending, n 



The former case correspond to a DGP which is poisson but slightly contami- 
nated by a geometric distribution. The second case is interpreted similarly as 
a geometric slightly contaminated by a poisson distribution. In the last case, 
7r = 0.535 is the value for which the poisson Djj(P,P^ ) and the geometric 

D^j{P ,Gp pH ) families are approximatively at equal distance to the mixture 
m{Ti) according to the penalized Hellinger distance with the above cells. Thus 
this set of experiments corresponds approximatively to the null hypothesis of 
our proposed model selection test l-LT h . The results of our different sets of 
experiments are presented in table 1-5. The first half of each table gives the 
average values of the the minimum penalized Hellinger distance estimator Xph 
and pph, the penalized Hellinger goodness-of-fit statistics Djj(P, ) and 

Dfj(P, Gp pH ), and the Hellinger indicator statistic r HX h . The values in paren- 
theses are standard errors. The second half of each table gives in percentage 
the number of times our proposed model selection procedure based on T-LI h 
favors the poisson model, the geometric model, and indecisive. The tests are 
conducted at 5% nominal significance level. 
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n 


20 


30 


40 


50 


300 


P 


0.196(0.04) 


0.213(0.03) 


0.203(0.02) 


0.203(0.02) 


0.201(0,01) 


X 


3.920(1.0) 


4.206(0.89) 


4.021(0.67) 


4.109(0.58) 


4.03(0.34) 


DHP(Pois) 


h— 1.0 




u.ouy(u in j 






n OAAl<~\ 071 




h— 5 


0.281(0.1) 


0.273(0.07) 


0.254(0.07) 


0.246(0.07) 


0.237(0.02) 


DHP(Geom) 


h— 1 


0.150(0.06) 


0.089(0.05) 


0.053(0.03) 


0.039(0.02) 


0.033(0.01) 




h=l/2 


0.103(0.04) 


0.067(0.03) 


0.044(0.02) 


0.035(0.02) 


0.027(0.98) 




h = 1/2 


1.880(1.43) 


2.560(1.37) 


3.020(1.25) 


3.340(1.14) 


3.40(1.03) 




Correct 


42% 


72% 


81% 


90% 


97% 




Indecisive 


58% 


28% 


19% 


10% 


03% 




Incorrect 


00% 


00% 


00% 


00% 


00% 


nx h 


h = 1 


1.710(1.07) 


2.260(1.05) 


2.760(0.96) 


3.01(0.65) 


4.19(0.32) 




Correct 


36% 


62% 


77% 


84% 


92% 




Indecisive 


64% 


38% 


23% 


16% 


08% 




Incorrect 


00% 


00% 


00% 


00% 


00% 



Table 2 : DGP=Geom(0.2) 



0.75"Geom+0.25"Pois 




5 20 25 30 



nil 



n=20 n=30 n=40 



Figure 5 : Histogram of 
DGP=0.75xGeom+0.25xPois with n=50 



Figure 6 : Comparative barplot of HI n depending n 



In the first two sets of experiments (tt = 0.00 and tt = 1.00) where one model 
is correctly specified, we use the labels 'correct', 'incorrect' and 'indecisive' 
when a choice is made. The first halves of tables 1-5 confirm our asymptotic 
results. They all show that the minimum penalized Hellinger estimators Xph 
and p PH converge to their pseudo-true values in the misspecified cases and to 
their true values in the correctly specified the sample size increases . 

With respect to our T-LI h , it diverges to — oo or +oo at the approximate rate 
of y/n except in the table 5. In the latter case the T-LI h statistic converges, 
as expected, to zero which is the mean of the asymptotic A/"(0, 1) distribution 
under our null hypothesis of equivalence. 

With the exception of table 1 and 2, we observed a large percentage of in- 
correct decisions. This is because both models are now incorrectly specified. 
In contrast, turning to the second halves of the tables 1-2, we first note that 
the percentage of correct choices using HI h statistic steadily increases and 
ultimately converges to 100%. 
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n 


20 


30 


40 


50 


300 


V 


0.213(0.13) 


0.197(0.12) 


0.208(0.08) 


0.202(0.05) 


0.202(0.01) 


X 


4.160(0.72) 


3.910(0.55) 


4.180(0.55) 


3.970(0.43) 


4.022(0.21) 


DHP(Pois) 


h — 1 


O.D4D(0. lo J 


A /I "7A /A 1 ^ 

0.4 /2(0. 1) 


A ^1 O/fl AA\ 

U. 412(0. Ua) 


A .IA1/A AQ\ 

0.402(0.00 ) 


A 3137/(1 AK°I 

0.30 / (O.Ob ) 






0.344(0.07) 


0.340(0.05) 


0.320(0.05) 


0.311(0.05) 


0.304(0.03) 


DHP(Geom) 


h — 1 


0.150(0.06) 


0.089(0.05) 


0.053(0.03) 


0.039(0.02) 


0.021(0.01) 




h=l/2 


-3.67(2.62) 


-4.32(2.53) 


-4.34(2.47) 


-4.83(2.27) 


-5.37(2.01) 


HZ h 


h = 1/2 


1 .220(1 .02) 


1.820(0.89) 


2.080(1.12) 


2.370(0.99) 


3.102(0.84) 




Gcom 


23% 


40% 


50% 


64% 


81% 




Indecisive 


77% 


60% 


50% 


36% 


19% 




Pois 


00% 


00% 


00% 


00% 


00% 


HX h 


h = 1 


0.840(1.29) 


0.831(1.27) 


0.845(1.16) 


0.967(1.05) 


1.131(0.78) 




Gcom 


17% 


15% 


19% 


22% 


33% 




Indecisive 


80% 


83% 


89% 


77% 


66% 




Pois 


03% 


02% 


02% 


01% 


01% 



Table 3 : DGP=0.75x Geom(0.2)+0.25xPois(4) 





Figure 7 : Histogram of 
DGP=0.25xGeom+0.75xPois with n=50 



Figure 8 : Comparative barplot of HI n depending n 



The preceding comments for the second halves of tables 1 and 2 also apply 
to the second halves of tables 3 and 4. In all tables (1,2,3 and 4), the results 
confirm, in small samples, the relative domination of the model selection pro- 
cedure based on the penalized Hellinger statistic test (h = 1/2) than the other 
corresponding to the choice of classical Hellinger statistic test (h = 1), in per- 
centages of correct decisions. Table 5 also confirms our asymptotics results : 
as sample size incerases, the percentage of rejection of both models converges, 
as it should, to 100%. 

In figures 1, 3, 5, 7 and 9 we plot the histogramm of datasets and overlay the 
curves for Geometric and poisson distribution. When the DGP is correctly 
specified figure 1, the poisson distribution has a reasonable chance of being 
distinguished from geometric distribution. 

Similarly, in figure 3, as can be seen, the geometric distribution closely ap- 
proximates the data sets. In figures 5 and 7 the two distributions are close but 
the geometric (figure 5) and the poisson distributions (figure 7) does appear 
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n 


20 


30 


40 


50 


300 


V 


0.213(0.03) 


0.212(0.03) 


0.210(0.02) 


0.206(0.02) 


0.203(0.01) 


X 


4.110(0.43) 


4.090(0.31) 


3.970(0.28) 


4.020(0.26) 


4.019(0.17) 


DHP(Pois) 


h — 1 




L .DOI I U.OKJ J 


J- .DOUl U . ZD 1 


1. J / U^I.^'iJ 






h=l/2 


1.443(0.24) 


1.473(0.21) 


1.520(0.20) 


1.500(0.18) 


1.483(0.14) 


DHP(Geom) 


h — 1 


2.055(0.35) 


1.870(0.25) 


1.860(0.21) 


1.790(0.19) 


1.704(0.11) 




h=l/2 


1.640(0.15) 


1.660(0.15) 


1.700(0.14) 


1.690(0.13) 


1.632(0.10) 


HI h 


h = 1/2 


-2.40(1.27) 


-2.44(1 . 1) 


-2.49(1.08) 


-2.77(1 .01) 


-2.89(0.92) 




Gcom 


00% 


00% 


00% 


00% 


00% 




Indecisive 


38% 


37% 


32% 


27% 


21% 




Pois 


62% 


63% 


68% 


83% 


79% 


HX h 


h = 1 


-2.18(1.37) 


-2.37(1.33) 


-2.31(1.36) 


-2.66(1.18) 


-2.83(1.06) 




Gcom 


00% 


00% 


00% 


00% 


00% 




Indecisive 


48% 


45% 


46% 


30% 


24% 




Pois 


52% 


55% 


54% 


70% 


76% 



Table 4 : DGP=0.75x Pois(4)+0.25x Geom(0.2) 




0.465*Geom+0.535*Pois 



-■■I 



Figure 9 : Histogram of 

DGP=0.465xGeom+0.535xPois with n=50 Figure 10 : Comparative barplot of HI n depending n 



to be much closer to the data sets. When tc = 0.535, the distributions for both 
(figure 9) poisson distribution and geometric distribution are similar, while 
being slightly symmetrical about the axis that passes through the mode of 
data distribution. This follows from the fact that these two distributions are 
equidistant from the DGP. 

and would be difficult to distinguish from data in practice. 



The preceding results in tables and the theorem (5.5) confirm, in figures 2, 4, 
6 and 8, that the Hellinger indicator for the model selection procedure based 
on penalized hellinger divergence statistic with h = 0.5 (light bars) dominates 
the procedure obtained with h = 1 (dark bars) corresponding to the ordinary 
Hellinger distance. As expected, our statistic divergence r HX h diverges to — oo 
(figure 2, 8) and to +oo (figure 4, figure 8) more rapidly when we use the 
penalized Hellinger distance test than the classical Hellinger distance test. 
Hence, Figure 10 allows a comparison with the asymptotic A/"(0, 1) approxi- 
mation under our null hypothesis of of equivalence. Hence the indicator T-LX 1 ^ 2 , 
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n 


20 


30 


40 


50 


300 


P 


0.196(0.06) 


0.204(0.05) 


0.211(0.03) 


0.213(0.207) 


0.204(0.01) 


A 


3.968(0.61) 


3.962(0.46) 


3.981(0.374) 


4.023(0.309) 


4.011(0.11) 


DHP(Pois) 


h— 1 




2.600(0.46) 


9 589(0 361 




9 31 1 fO 95( 




h=l/2 


2.633(0.30) 


2.492(0.28) 


2.369(0.27) 


2.302(0.26) 


2.142(0.17) 


DHP(Geom) 


h— 1 


2.867(0.52) 


2.682(0.37) 


2.553(0.30) 


2.495(0.20) 


2.237(0.12) 




h=l/2 


2.157(0.21) 


2.200(0.20) 


2.263(0.20) 


2.287(0.19) 


2.291(0.15) 


HI h 


h = 1/2 


a a *7 a ^ 1 
-U.U 7 9(1. U4 ) 


0.lM»( 1 .Uo ; 


a 1 o a ^ a a a 1 
0. 182(0. 9y) 


A 1A\ 

0.334(1 . 10) 


A /I ^ A / A *7 \ 

0.442(U.6 / ) 




Gcom 


03% 


04% 


05% 


10% 


13% 




Indecisive 


92% 


92% 


93% 


88% 


88% 








04% 




02% 


01% 




h = 1 


0.186(1.14) 


0.248(1.64) 


0.378(0.90) 


0.452(0.86) 


0.617(0.73) 




Gcom 


05% 


06% 


04% 


09% 


11% 




Indecisive 


92% 


90% 


95% 


90% 


88% 




Pois 


03% 


04% 


01% 


01% 


01% 



Table 5 : DGP=0.535x Pois(4)+0.465x Geom(0.2) 



based on the penaliezd Hellinger distance is closer to the mean of A/"(0, 1) than 
is the indicator HI 1 . 



7 Conclusion 



In this paper we investigated the problems of model selection using divergence 
type statistics. Specifically, we proposed some asymptotically standard nor- 
mal and chi-square tests for model selection based on divergence type statistics 
that use the corresponding minimum penalized Hellinger estimator. Our tests 
are based on testing whether the competing models are equally close to the 
true distribution against the alternative hypotheses that one model is closer 
than the other where closeness of a model is measured according to the discrep- 
ancy implicit in the divergence type statistics used. The penalized Hellinger 
divergence criterion outperforms classical criteria for model selection based on 
the ordinary Hellinger distance, especially in small sample, the difference is 
expected to be minimal for large sample size. Our work can be extended in 
several directions. One extension is to use random instead of fixed cells. Ran- 
dom cells arise when the boundaries of each cell q depend on some unknown 
parameter vector 7, which are estimated. For various examples, see e.g., An- 
drews (1988b). For instance, with appropriate random cells, the asymptotic 
distribution of a Pearson type statistic may become independent of the true 
parameter 6q under correct specification. In view of this latter result, it is ex- 
pected that our model selection test based on penalized Hellinger divergence 
measures will remain asymptotically normally or chi-square distributed. 
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